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ferent languages HTML documents (16) avail- 
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lation software (10) bundled in a browser 
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INTEGRATED MULTILINGUAL BROWSER 

5 ■ 

BACKGROUND AND SUMMARY OF THE INVENTION 
The present invention relates generally to the field of electronic communication over a 
computer network. Particularly, the present invention relates to the expansion of multi-lingual 
electronic communication through translation services for documents and messages available 
10 through the Internet. 

The recent surge in media attention to the Internet, and especially the World Wide 
Web, coupled with the continuing growth in home PC ownership have resulted in a growing 
diversity of the Internet user population. No longer is the Internet the province of software 
experts; thousands of novice users have begun to come online each day. Software like 
15 CompuServe's Web Browser lets users quickly connect to and find useful content online. This 
phenomenon is not restricted to the United States or to English-speaking countries. Growth 
in online usage in Europe and Asia is increasing even more quickly than in the U.S. 

While interest in the online world is at a peak, a significant obstacle exists to broad 
usage of the Internet for non-English speakers. The vast majority of Internet content is in 
20 English, and is therefore inaccessible to users with other native languages. Translation of 
Internet documents by a human translator is not a practical solution for two reasons. First, 
human translation is costly and slow. A translator can typically produce 300-400 words per 
hour at costs of 120 per word or more. Second, in order to have a translator convert Internet 
documents to the user's native language, the user would have to download every document he 
25 was interested in to provide it to the translator. This is a time-consuming process, and if the 
user knows no English, he will not even be able to assess the relevance of the document before 

1 
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downloading it. This would result in wasted time and translation costs since inevitably, some 
of the documents selected will not prove to be worthwhile. 

The present invention allows non-English speaking Internet users to access and 
understand information available from the World Wide Web and related sources. Language 
5 translation software (known as machine translation, or MT) is combined with Internet 
software to allow non-English speaking Internet users to quickly generate translations of 
online text. The process is automated and therefore, less costly and time-consuming than 
human translation. Advantages of the present invention are explained further m relation to the 
following detailed description of the invention, drawings, and claims. 

10 

BRIEF DESCRIPTION OF THE DRAWINGS 
Figures 1 A and IB comprise a screen shot of a World Wide Web page; 
Figure 2 is an example of a hypertext document; 

Figure 3 is an example of a hypertext document preprocessed according to the method 
15 of the present invention; 

Figure 4 illustrates a system for performing machine translation; 
Figures 5A and 5B comprise an example of a preprocessed hypertext document 
translated according to the method of the present invention; 

Figure 6 is an example of a translated hypertext document postprocessed according to 
20 the method of the present invention; 

Figures 7A and 7B comprise a screen shot of a World Wide Web page that has been 
translated according to the method of the present invention; 

Figure 8 is a diagrammatic view of one embodiment of the present invention in which 
machine translation is integrated into a Web browser; and 

2 
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Figure 9 is a diagrammatic view of one embodiment of the present invention in which 
pre-translated Web pages are accessible from a server. 

DETAIL DESCRIPTION OF PREFERRED EMBODIMENTS S> 
5 Although the detailed description of a preferred embodiment focuses on automatic 

translation of World Wide Web pages, the concept is adaptable to documents obtained from 
other sources. 

The World Wide Web (WWW or the Web) is a distributed information system that 
may be accessed through a number of sources. It is comprised of software and a set of 
10 protocols and conventions. Information on the Web may be accessed using a browser 
program such as CompuServe's Web Browser. Browsers allow users to read documents and 
to locate documents from other sources. They present an interface for interacting with the 
system and they process requests on behalf of the user. 

Information providers on the WWW make their information available through 
15 programs that understand the HyperText Transfer Protocol (HTTP). Browsers assist users in 
'Visiting'' Web sites where information is stored. Information is displayed in pages of text and 
graphics called <c Web Pages." An example of a Web page as viewed through CompuServe's 
Web Browser is provided in Figures 1A and IB. The Web page shown in Figures 1A and IB 
contains both text 14, 18 and graphics 10, 12, 16. The title bar 20, menu options 22, buttons 
20 24, and document information 26 appearing at the top of the screen are part of the browser 
used to view the Web page. 

In most cases, information providers make information available through a Web server. 

The server responds to information requests by delivering the requested information to the 

user's browser for viewing. Some providers may make their information available through a 

3 



WO 97/18516 PCT/US96/18102 

proxy server that converts information in one format to the format expected and understood 
by the browser. 

Documents available on the WWW and displayed by browsers are hypertext 
documents. Hypertext is text that contains references (or 'links," "hyperlinks," or i( hot 
spots") to other documents. The reference is similar to a footnote except the referenced 
document may be accessed directly from the original document. The related document may be 
viewed by selecting or clicking the mouse on the reference. The process of selecting 
hyperlinks to view referenced documents may be referred to as traversing the hyperlinks." 
Unlike a footnote, references usually do not appear as shorthand descriptions of related 
documents. Instead, references may be indicated by a combination of graphics, different fonts, 
different colors for the text, underlining, the mouse pointer turning into a hand, etc. The 
referenced documents may reside on different computers at different Web sites. 

Hypertext documents are written in a "markup language" call Hypertext Markup 
Language (HTML). HTML actually refers to both a document type and the markup language 
that represents instances of the document type. A hypertext document contains general 
semantics appropriate for representing display or presentation characteristics as well as 
information from a wide ranges of domains. A hypertext document consists of a sequence or 
stream of characters that comprise both data characters and markups. Markups are 
syntactically delimited characters (such as "<," 4t >," "#," etc.) added to the data characters to 
define the document's structure. Markups thus have special meanings and may represent such 
things as hypertext, news, mail, documentation, menus of options, and in-line graphics. 
Markups may be combined with other characters or related values to create codes that also 
have special meaning. Data characters are those characters in the document that are not 
codes. 
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Figure 2 is the hypertext document that describes the Web page shown in Figures 1A 
and IB. Figure 2 shows the markups and related words (that comprise codes) as well as data 
characters that may appear in a hypertext document. For example, the characters and 
appearing throughout the document are markups. The characters and combined with 
5 the word "head" ("<head>") 10 may be considered a code. Finally, the text CC NLT Home" 10 
that is not surrounded by markups or codes may be considered data characters! 

As indicated by the brief description, HTML documents have a well-defined and 
documented structure defined by a grammar. The codes in a HTML document convey 
important information regarding both the display or presentation of the document itself as well 
10 as related references and commands. Display and presentation information may include color 
information, information about graphics that appear on the page, information about text that 
appears on the page, etc. A HTML document is structured as a series of elements that are 
identified by the language markups and codes. A document includes a head (consisting of a 
title and other optional elements) and a body that is a text flow of paragraphs, lists, images, 
15 and other elements. The various parts of the document may be identified by looking at the 
markups or codes in the document. For example, referring again to Figure 2 which shows the 
hypertext for Figures 1 A and IB, the document head contains the title tc NLT Home" 10. An 
image contained in the document is identified in the line 

"<br><img src-^e:///n|/iowebsrv/server/8 W height=60 
20 width=640></center>" 12. 

As may be apparent, the process of translating a HTML document requires 
examination of each character in document. Characters may be examined individually and in 
combination to determine whether they are markups, codes, or data characters. To process a 
document, the processing software examines the character stream that comprises the 
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document. The steps needed to translate a HTML document from one language to another 
may be summarized as follows: 

Step 1. Preprocess the HTML document by placing boundary markers around 
HTML codes to be preserved during the translation process. The 
translation software recognizes the boundary markers and does not 
translate text and symbols appearing between the markers. 
Step 2. Translate the preprocessed HTML document from the original language to 
the target language. 

Step 3. Postprocess the translated HTML document to remove the boundary 
markers. 

Step 1. The codes in a HTML document convey important information describing the 
characteristics of the Web page. Referring again to Figure 2, an example of the type of 
information contained in a hypertext document is shown. Certain information contained in the 
document of Figure 2 may be interpreted by a Web browser so that to the browser user, the 
images shown in Figures 1 A and IB appear. Certain information in the hypertext document is 
preserved during the translation process so that the translated page has, in general, the same 
appearance and behavior as the original page. Because HTML documents have a well-defined 
and known structure described by a grammar, automated translation of a HTML document is 
possible. The codes in the document may be discerned by the preprocessing software. Special 
boundary markers placed in the document by the preprocessing software indicate to the 
translation software that the intervening text should not be translated. Consequently, the 
resulting page may have the same appearance and behavior as the original page. 

Referring to Figure 3, an example of a preprocessed HTML document is shown. The 
HTML document of Figure 3 is the preprocessed version of the HTML document shown in 
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Figure 2. In this example, the boundary markers used to identify the HTML codes are the 
character pairs **{." and " }" Any character or character combination that does not normally 
occur in text may be used as a boundary marker. The line that appeared as 
"<head><title>NLT Home<thIe><head>" in Figure 2 (10) is preprocessed in Step 1 to the line 
5 %<head> }{ <title> }NLT Home{.<title> }{.<head> }" in Figure 3 (10). Other lines in the 
document are preprocessed similarly. 

Step 2. Machine Translation (MT) software performs the translation of text from one 
language to another language. There are many commercially available MT software packages. 
Figure 4 is an illustration of a system in which MT software 10 takes as input text in one 

10 language 12 and generates a rough draft translation of the text in another language 14 using an 
electronic dictionary 16 and a set of linguistic and/or statistical rules encoded in the program 
18. MT software can perform language conversion operations very quickly; in some cases, at 
speeds of up to 3,000 words per minute. The translated texts are not high quality translations, 
but they are usually adequate for understanding what the document is about. 

15 Referring to Figures SA and SB, an example of a translated HTML document is 

shown. The HTML document of Figures 5A and SB is the translated version of the 
preprocessed HTML document shown in Figure 3. As described above, the boundary markers 
used to identify the HTML codes are the character pairs "{." and }". Consequently, the MT 
software ignores all text that falls between the boundary markers. Data characters that are not 

20 surrounded by boundary markers are translated by the MT software. The preprocessed line 
that appeared as %<head>.}{.<tkle>.}NLT Home{.<title>.){.<head> }" in Figure 3 (10) is 
translated in Step 2 to the Bne "{.<head> }{.<tkle>.}NLT Maison{.<thle> }{.<head> }" in 
Figure 5A (10). 
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Step 3. In the final step, postprocessing software removes boundary markers from the 
translated document. Referring to Figure 6, an example of a postprocessed HTML document 
is shown. The HTML document of figure 6 is the postprocessed version of the translated 
HTML document shown in Figures 5A and 5B. As described above, the boundary markers 
5 used to identify the HTML codes are the character pairs **{." and ".}". During 
postprocessing, these boundary markers are removed. The translated line that appeared as 
"{ <head>. }{ <title> }NLT Maison{ <title> }{.<head> }" in Figure 5A (10) is postprocessed 
in Step 3 to the line "<head><title>NLT Maison<titlexhead>" in Figure 6 (10). The 
postprocessed HTML document of Figure 6 may then be displayed by the browser as shown in 

10 Figures 7 A and 7B. 

Figure 8 is a diagrammatic view of one embodiment of the present invention in which 
machine translation is integrated into a Web browser. MT software 10 may be combined with 
a browser 12 to allow the user 14 to rapidly and automatically translate online documents 
from the World Wide Web 16 into his native language. The MT software 10 may be bundled 

15 with the browser 12 to form an integrated multilingual browser. The user 14 of the 
multilingual browser 16 selects the desired target language, (e.g. French if the user speaks 
French), and the Web document retrieved by the browser 1 8 may be rapidly translated on-the- 
fly with a mouse click. The Web Browser 12 then displays for the user 14 the translated 
document 20. Optionally, the user may be able to update and edit parts of the MT software's 

20 electronic dictionaries to include terminology common to the Web sites he visits. 

Although a document may be translated at the time that a user requests access to the 
document, a document may also be "pre-translated" and stored in a cache for later retrieval 
before a user seeks access to it. Documents that have been accessed at least once may also be 
stored following translation. The advantage of storing documents that have been translated is 
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that delivery time to the user may be reduced. Although storing documents requires disk 
space, it may represent a better use of system resources because documents that are accessed 
frequently are translated once rather than every time they are accessed. 

Figure 9 is a diagrammatic view of an alternative implementation in which pre- 
5 translated Web pages are stored on a Web server 14. The translation software resides on a 
translation server 14 (possibly the same machine as the Web server). Popular Web pages 24 
are pre-translated and stored in a cache 28, with additional pages being added as they are 
requested by users 20. The cache is a dynamic storage device with a finite capacity. New, 
pretranslated pages are added to the cache, but pages will also be removed from the cache if 

10 they are used infrequently or if there are constraints on storage capacity. 

In accessing the system, a user 10, sends to the Web Server 14 a request for a specific 
page in a specific language 12. The Web Server 14 then sends a request to get the desired 
page 16. The method for servicing the request depends on where the page is located. If the 
page has been pre-translated 24 and stored in the cache of pages in multiple languages 28, it is 

15 retrieved from the cache 26 and returned to the user in the requested language 30. If the page 
has not been pre-translated. then the page is retrieved 20 from the World Wide Web 22, 
translated into the requested language, and cached before being sent to the user 30. 

Translation of Web pages, in either the bundled browser/MT configuration or the Web 
Server configuration, requires processing of HTML codes containing reference, command, 

20 and display information. Preferably, the HTML codes are identified prior to translation, then 
surrounded by special boundary markers to block the translation process on the codes. Hie 
HTML preprocessor uses its knowledge regarding the markups, codes, data characters and the 
structure of HTML documents to determine which codes should be blocked from the 
translation process. After translation is complete, a postprocessing program removes the 
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special boundary markers so that the necessary references, commands, and display 
characteristics are available in the translated text. 

The primary objective of the present invention is to allow a user of the Internet to read 
hypertext documents that are available only in a language foreign to the user. The readable 
5 text of the hypertext document is changed in accordance with the user's preferred language. 
Steps are taken to preserve the document's appearance and behavior so that the only 
noticeable difference between the original document and the translated document is the 
language of the text. Users may interact with the translated document and reference related 
documents in the same manner that users interact with the original document. 



10 
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WHAT IS CLAIMED IS: 

1. A method for translating a document, comprising the steps of: 

providing a character stream including codes and data characters in a first 
language; 

transmitting said character stream to a language translator; 
recognizing said codes to prevent translation of said codes by said language 
translator; and 

translating a significant portion of said data characters into a second language 
using said language translator. 

2. The method of claim 1, wherein said codes are Hypertext Markup Language codes. 

3. The method of claim 1, wherein said step of recognizing said codes is performed by a 
language translator preprocessor. 

4. The method of claim 1, wherein boundary markers are placed around said codes to 
prevent translation of said codes by said language translator. 

5. The method of claim 1, wherein said language translator is integrated into a browser 
program. 

6. The method of claim U wherein said document is pretranslated. 

7. The method of claim 1, further comprising the step of viewing said translated 
HyperText Markup Language document with a browser. 

8. A document translation system, comprising: 

a character stream containing codes and data characters in a first language; 
a preprocessor for marking codes in said character stream; 
a language translator for translating into a second language said data characters 
in said preprocessed character stream; and 

11 
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a postprocessor for unmarking said codes in said translated character stream. 

9. The system of claim 8, wherein said codes are defined using HyperText Markup 
Language. 

10. The system of claim 8, wherein said preprocessor, said language translator, and said 
postprocessor are integrated into a document browser. 

11. The system of claim 8, wherein said preprocessor, said language translator, and said 
postprocessor are integrated into a Web server process. 

12. The system of claim 8, further comprising a browser for viewing said translated 
HyperText Markup Language document. 

13. A method for translating documents, comprising the steps of: 

providing in a first language a document containing display and reference codes 
and data characters exclusive of said display and reference codes; 

viewing and interacting with said document in said first language; 

translating said data characters to a second language; and 

viewing and interacting with said translated document in substantially the same 
manner as said document in said first language. 
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